{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 使用Video MAE进行快速推理\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Load video 加载视频"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "让我们从[Kinetics-400数据集](https://www.deepmind.com/open-source/kinetics)中加载一个视频。该数据集包含数百万个YouTube视频，每个视频都被标注为400种可能类别中的一种。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2024-10-11 10:58:58--  https://huggingface.co/datasets/nielsr/video-demo/resolve/main/eating_spaghetti.mp4\n",
      "Connecting to 172.26.1.26:12798... connected.\n",
      "Proxy request sent, awaiting response... 302 Found\n",
      "Location: https://cdn-lfs.hf.co/repos/21/27/2127ba3909eec39f0c04aa658b6aa97c12af51427ff415d000d565c97e36724b/252f63d13748f08acf56765c295506bfdb8bb73b822e93a33a57d73988814a71?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27eating_spaghetti.mp4%3B+filename%3D%22eating_spaghetti.mp4%22%3B&response-content-type=video%2Fmp4&Expires=1728874740&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyODg3NDc0MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy8yMS8yNy8yMTI3YmEzOTA5ZWVjMzlmMGMwNGFhNjU4YjZhYTk3YzEyYWY1MTQyN2ZmNDE1ZDAwMGQ1NjVjOTdlMzY3MjRiLzI1MmY2M2QxMzc0OGYwOGFjZjU2NzY1YzI5NTUwNmJmZGI4YmI3M2I4MjJlOTNhMzNhNTdkNzM5ODg4MTRhNzE%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=EzznF3jFk-X9cigZ-9n7MztXv2LsEIH6wMJLqsFaXn5Rqf-Ito14neMF-IZNnUBVCXLqHp2ZofWutpKgamp7V6xNf%7E8UGvVl0yF40MHeeICC%7EiMzhGOMbnPA8fowyejxc-yKfn3TVSlQaeBKFGfobUjMc0rmfzLu2KVdqay74vZf1N6AfUP7zDD-vvqIwdqG9GsuY%7ErCNxaCGTqQeqhpoCRtK6abFvB7ywuEvI-JUdh7GFy5wX9-2lULqaueSkEyLmPeYkY6WF5LcOg-42rjFhO4460ArD-SbpE3gClCHwbHbHGkfvy6mTVyNmDmvW1M5cmFmkAH3QxCyPSSPoPwQw__&Key-Pair-Id=K3RPWS32NSSJCE [following]\n",
      "--2024-10-11 10:59:00--  https://cdn-lfs.hf.co/repos/21/27/2127ba3909eec39f0c04aa658b6aa97c12af51427ff415d000d565c97e36724b/252f63d13748f08acf56765c295506bfdb8bb73b822e93a33a57d73988814a71?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27eating_spaghetti.mp4%3B+filename%3D%22eating_spaghetti.mp4%22%3B&response-content-type=video%2Fmp4&Expires=1728874740&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyODg3NDc0MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy8yMS8yNy8yMTI3YmEzOTA5ZWVjMzlmMGMwNGFhNjU4YjZhYTk3YzEyYWY1MTQyN2ZmNDE1ZDAwMGQ1NjVjOTdlMzY3MjRiLzI1MmY2M2QxMzc0OGYwOGFjZjU2NzY1YzI5NTUwNmJmZGI4YmI3M2I4MjJlOTNhMzNhNTdkNzM5ODg4MTRhNzE%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=EzznF3jFk-X9cigZ-9n7MztXv2LsEIH6wMJLqsFaXn5Rqf-Ito14neMF-IZNnUBVCXLqHp2ZofWutpKgamp7V6xNf%7E8UGvVl0yF40MHeeICC%7EiMzhGOMbnPA8fowyejxc-yKfn3TVSlQaeBKFGfobUjMc0rmfzLu2KVdqay74vZf1N6AfUP7zDD-vvqIwdqG9GsuY%7ErCNxaCGTqQeqhpoCRtK6abFvB7ywuEvI-JUdh7GFy5wX9-2lULqaueSkEyLmPeYkY6WF5LcOg-42rjFhO4460ArD-SbpE3gClCHwbHbHGkfvy6mTVyNmDmvW1M5cmFmkAH3QxCyPSSPoPwQw__&Key-Pair-Id=K3RPWS32NSSJCE\n",
      "Connecting to 172.26.1.26:12798... connected.\n",
      "Proxy request sent, awaiting response... 200 OK\n",
      "Length: 1013655 (990K) [video/mp4]\n",
      "Saving to: ‘eating_spaghetti.mp4’\n",
      "\n",
      "eating_spaghetti.mp 100%[===================>] 989.90K   304KB/s    in 3.3s    \n",
      "\n",
      "2024-10-11 10:59:04 (304 KB/s) - ‘eating_spaghetti.mp4’ saved [1013655/1013655]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget https://huggingface.co/datasets/nielsr/video-demo/resolve/main/eating_spaghetti.mp4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2e92801ebfd94ada90dcf175c561a222",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Video(value=b'\\x00\\x00\\x00 ftypisom\\x00\\x00\\x02\\x00isomiso2avc1mp41\\x00\\x00\\x00\\x08free...', width='500')"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from ipywidgets import Video\n",
    "\n",
    "video_path = \"eating_spaghetti.mp4\" \n",
    "Video.from_file(video_path, width=500)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Prepare video for model 为模型准备视频"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以通过使用VideoMAEFeatureExtractor为模型处理视频。首先，从最多300帧中采样16帧，并将这些帧输入特征提取器。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "它将进行一些基本的预处理操作，包括对视频的每一帧进行调整大小、中心裁剪以及归一化处理。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Building prefix dict from the default dictionary ...\n",
      "Loading model from cache /tmp/jieba.cache\n",
      "Loading model cost 1.009 seconds.\n",
      "Prefix dict has been built successfully.\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "06fc8cee68b441e59572c31d80a95124",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0.00/271 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/root/miniconda3/envs/py39/lib/python3.9/site-packages/mindnlp/transformers/models/videomae/feature_extraction_videomae.py:28: FutureWarning: The class VideoMAEFeatureExtractor is deprecated. Please use VideoMAEImageProcessor instead.\n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "source": [
    "from mindnlp.transformers import VideoMAEFeatureExtractor\n",
    "\n",
    "feature_extractor = VideoMAEFeatureExtractor.from_pretrained(\"MCG-NJU/videomae-base-finetuned-kinetics\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(16, 360, 640, 3)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from decord import VideoReader, cpu\n",
    "import numpy as np\n",
    "\n",
    "# video clip consists of 300 frames (10 seconds at 30 FPS)\n",
    "vr = VideoReader(video_path, num_threads=1, ctx=cpu(0)) \n",
    "\n",
    "def sample_frame_indices(clip_len, frame_sample_rate, seg_len):\n",
    "  converted_len = int(clip_len * frame_sample_rate)\n",
    "  end_idx = np.random.randint(converted_len, seg_len)\n",
    "  str_idx = end_idx - converted_len\n",
    "  index = np.linspace(str_idx, end_idx, num=clip_len)\n",
    "  index = np.clip(index, str_idx, end_idx - 1).astype(np.int64)\n",
    "  \n",
    "  return index\n",
    "\n",
    "vr.seek(0)\n",
    "index = sample_frame_indices(clip_len=16, frame_sample_rate=4, seg_len=len(vr))\n",
    "buffer = vr.get_batch(index).asnumpy()\n",
    "buffer.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1, 16, 3, 224, 224)\n"
     ]
    }
   ],
   "source": [
    "# create a list of NumPy arrays\n",
    "video = [buffer[i] for i in range(buffer.shape[0])]\n",
    "\n",
    "encoding = feature_extractor(video, return_tensors=\"ms\")\n",
    "print(encoding.pixel_values.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Load model 加载模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来，让我们从hub中加载模型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "60ead92d5b7546b789aa584e3467bc56",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "0.00B [00:00, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "fcb407f2f8e94337ade31c149f35b16c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0.00/1.13G [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from mindnlp.transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification\n",
    "\n",
    "model = VideoMAEForVideoClassification.from_pretrained(\"MCG-NJU/videomae-large-finetuned-kinetics\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Forward pass 前向传播"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "pixel_values = encoding.pixel_values\n",
    "\n",
    "# forward pass\n",
    "outputs = model(pixel_values)\n",
    "logits = outputs.logits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Predicted class: eating spaghetti\n"
     ]
    }
   ],
   "source": [
    "predicted_class_idx = logits.argmax(-1).item()\n",
    "\n",
    "print(\"Predicted class:\", model.config.id2label[predicted_class_idx])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "py39",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
